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Abstract 

Background: Relationships between species, genes and genomes have been printed as trees for over a century. 
Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20 th 
century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic 
knowledge. However, novel software is needed to defrost these published phylogenies for the 21 st century. 

Results: TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees 
(http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/). The program accepts a range of input image formats (PNG, 
JPG/JPEG or GIF). The underlying command line C++ program follows a number of cleaning steps to detect lines, 
remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then 
determined to detect the branch length, tip label positions and the topology of the tree. Optical Character 
Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of 
images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves. 

Conclusions: Despite the diversity of ways phylogenies have been illustrated making the design of a fully 
automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past 
phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or 
benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public 
Licence v3. 



Background 

In 1859, Darwin produced one of the first illustrations of a 
phylogenetic tree, notably this was the only figure included 
in The Origin of Species [1]. Since, biologists have used 
trees to depict the relationships between organisms, genes 
and genomes. The number of studies depicting phyloge- 
nies exploded (see Figure 1) with the development of the 
polymerase chain reaction technique and journals were 
created specifically for publishing the molecular phyloge- 
nies generated by researchers (e.g., Molecular Phyloge- 
netics and Evolution established in 1992). Whilst in the 
early years of morphological and molecular phylogenetics, 
embedding illustrations into manuscripts might have been 
the most appropriate way to disseminate knowledge, this 
has resulted in the locking up of phylogenetic hypotheses 
into the pages of journals and books without an easy way 
to access this information. 
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Currently, the construction of the relationships 
between the 1.8 million currently estimated species lar- 
gely depends on the unprecedented growth of molecular 
sequence data [2] and this makes GenBank the most 
accessible source of comparative data for most taxa in 
the tree of life [3]. Whilst more sequence data, more 
powerful computers and improved phylogenetic recon- 
struction algorithms will enable researcher to generate 
up-to-date phylogenies from the raw data in the future, 
past phylogenetic inferences will remain central to guid- 
ing researchers towards studying poorly supported rela- 
tionships and under-sampled lineages. They are also 
central for studying the effects of new phylogenetic 
methodologies and new and larger datasets [2]. 

Not all phylogenetically informative data are confined 
to sequence databases. TreeBASE is a very valuable 
repository as it holds morphological or genetic data with 
the associated published phylogeny [4]. However, as few 
publishers require submission to TreeBASE as a pre- 
requisite for publication, a large number of phylogenies 
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Figure 1 Percentage of articles with phylogen* in the title. The 

percentage of articles with phylogen* in the title out of the total 
number of publication for each year since 1980 from PubMed. 



remain embedded as images in published articles. 
Indeed, the rapid growth of published phylogenies is not 
matched by the availability of those trees in databases 
(see Figure 1 in [5]). 

The idea of using a program to convert a tree image into 
a computer-readable representation of that tree was first 
implemented in TreeThief [6] which required the user to 
trace a tree by clicking on each of its nodes in turn. The 
latter program is only available for the discontinued oper- 
ating system Mac OS 9. Laubach and von Haeseler [7] 
provided a conceptual advance with a semi-automatic pro- 
gram called TreeSnatcher that has recently been updated 
[8]. TreeSnatcher uses image-processing methods to pre- 
pare a tree image and detect the tree structure, it works 
on rectangular and freeform trees (e.g., radial and star). 
The user supervises the tree recognition process by mak- 
ing corrections to the image. For example, the user can 
modify the image in order to make the foreground dark 
and background light, fill gaps in lines and identify the 
foreground. The program then determines inner node and 
tip locations. The user can add or remove further nodes 
and delete or add branches. The user is then required to 
assign species names to the tips before the program can 
build the Newick tree code. 

Here, we will review the way researchers present their 
phylogenies, demonstrate the feasibility of a fully auto- 
mated tree recognition software and provide a dataset of 
tree images and associated tree files for training and/or 
benchmarking future programs. 

Implementation 

The current version of TreeRipper opens tree-image 
files in the formats PNG, JPG/JPEG, or GIF. 



♦ The tree needs to have the root on the left and 
leaves on the right. 

♦ Horizontal branches. 

♦ The tree constitutes a dark foreground on a light 
homogenous background (no background boxes or 
shading). 

♦ The tree must be bi- or multifurcating (not a 
network) 

♦ The inner nodes are branching points between 
lines and have no circles, rectangles, etc. inscribed. 

♦ Tip branches must have branch lengths greater 
than 0. 

TreeRipper is written in C++ using a set of Standard 
Template Library algorithms provided by Magick++. The 
image is first converted to black and white and rescaled 
so that horizontal lines are on average 2 pixels thick. The 
image is cleaned by removing a series of patterns such as 
black pixels surrounded by a box of white pixels and hor- 
izontal lines that are not connected to vertical lines. 
Lines and corners are then patched up before the contour 
is traced and the topology detected. The locations of 
branch tips are then used to crop the tip labels from the 
original image. Tip labels are converted to text using the 
freely available tesseract-ocr program. The steps in the 
program are depicted in Figure 2. The web application 
written in PHP enables the visualization of the tracing 
and allows editing of the labels. 

Results and Discussion 

We downloaded 322 images which had phylogen* or 
supertree in their caption from 249 articles published in 
the Open Access journal BMC Evolutionary Biology 
between 1997 and 2009. Only eleven out of these 249 
articles have submitted their alignment and tree files to 
TreeBASE. All images were visually inspected to check 
whether the image met the prerequisites. Twenty-four 
images were not phylogenies, 26 were represented as 
radial tree layouts, 8 as polar tree layouts and 5 as cla- 
dograms. Of those represented with a rectangular tree 
layout, 40 had background boxes, 31 had lines intersect- 
ing branches or branches drawn with dotted or dashed 
lines, 32 had circles or boxes as nodes, 6 were illustrated 
over multiple pages, 4 had triangles as tip leaves, 3 had 
leaves with zero branch lengths. A further 29 would 
need some form of pre-processing (rotating or splitting 
into component images). Of the 298 images of phyloge- 
nies downloaded only 114 (38%) met the prerequisites 
for this program, which are very similar to those of the 
original semi-automatic recognition software TreeS- 
natcher [7]. This small proportion of the total phyloge- 
netic images illustrates the plethora of ways trees are 
currently represented in one journal alone. Of the 114 
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Input image 



Cleaned image Contour detection 



- Marsupialia 





#NEXUS 
Begin TREES; 

Translate 

1 'Marsupialia', 

2 'Xenarthra', 

3 'Scandenha', 

4 'Primates', 

5 'Eulipctyphla', 

6 'Afrosoricida', 

7 'Tubulidcntata', 

8 'Macroscclidca', 

9 'Hyraccidca', 

10 'Sirenia', 

11 'Probcscidea' 



TREE TREERIPPER = (1 ,(2,(((3,4),5),(6,((7,8),(9,(10,11 ))))))); 



Conversion to 
text format 



End; 

Figure 2 Architecture of the software design for TreeRipper. The input image is scaled, node labels are removed, branches are smoothed 
and corners patched-up, the contour is detected. Tips locations are used to determine leaf label boxes for which the text is recognised using 
Tesseract. TreeRipper summarizes the tree topology and labels in a text file and an SVG file, which shows the contours. 



phylogenies that meet the prerequisites, the topologies 
of 37 trees (i.e., 32%) were successfully recognized by 
TreeRipper without any prior processing. The propor- 
tion of successfully recognised images was higher for 
phylogenies with fewer leaves (Figure 3) and the largest 
phylogeny successfully recognised had 115 leaves. The 
average processing time was 127 seconds (ranging from 
4 to 562 seconds) using a MacBook Pro (2.4 GHz Intel 
Core 2 Duo with 2 GB 667 MHz DDR2 SDRAM). We 
do not review the accuracy of the OCR here as it has 
been done elsewhere (see [9]). 

The successfully recognised tree images along with a 
further 63 images manually converted to tree files are 
provided as supplementary material in NEXUS, Newick 
and phyloXML formats [10] (Additional file 1) for train- 
ing and/or benchmarking future programs. 



Conclusions 

Although the program has a high failure rate, it is the 
first step towards an automated approach for optical 
tree recognition and proves the feasibility of an 
approach, which will allow us to defrost published phy- 
logenetic hypotheses. We are unlikely to ever be able to 
create an application that recognises all possible trees 
due to the sheer diversity of ways phylogenies have been 
illustrated but at the very least, this program could be 
used for automating tree recognition of large sets of tree 
images before using manual conversion or semi-auto- 
mated programs like TreeSnatcher for the trees that 
were not converted. 

As phylogenetics enters a third phase of growth with 
the advent of next-generation sequencing, one hopes 
that the work of future phylogenetists will be published 
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Figure 3 Proportion of images successfully recognised. The 

proportion of tree images successfully recognised by TreeRipper 

according to the number of leaves on the phylogeny. 
\ ) 

in a format that will enable the digital curation and pre- 
servation of their hard work. 

Availability and requirements 

Project name: TreeRipper (automated phylogeny recog- 
nition from images) 

Webserver: http://linnaeus.zoology.gla.ac.uk/~jhughes/ 
treeripper 

Project home page: https://code.google.eom/p/ 
treeripper/ 

Programming language: C++ and PHP web interface 
License: GNU GPL v3 
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Prerequisites 

Tesseract-OCR licensed with the Apache 2.0 License 
except the tesseractTrainer.py, which is licensed with 
GPL: http://code.google.eom/p/tesseract-ocr 

Imagemagick, license is compatible with the GPL: 
http://www.imagemagick.org/ 

Additional material 



Additional file 1: Tree images, associated newick file and example 
Perl script for batch processing. Set of images and associated nexus 
tree file as a zip file. 
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